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The sample complexity of active learning under the realizability assumption has been well-studied. 
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^N , The realizability assumption, however, rarely holds in practice. In this paper, we theoretically 
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characterize the sample complexity of active learning in the non-realizable case under multi- 



^ ■ view setting. We prove that, with unbounded Tsybakov noise, the sample complexity of multi- 

Q \ view active learning can be O(log-), contrasting to single- view setting where the polynomial 

improvement is the best possible achievement. We also prove that in general multi-view setting 
^ \ the sample complexity of active learning with unbounded Tsybakov noise is O(-), where the order 

00 
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^\ \ bounds where the order of 1/e is related to the parameter in Tsybakov noise. 
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of 1/e is independent of the parameter in Tsybakov noise, contrasting to previous polynomial 



1. Introduction 



In active learning 



id . lid . Il6l ] , the learner draws unlabeled data from the unknown distribution 
defined on the learning task and actively queries some labels from an oracle. In this way, the 
active learner can achieve good performance with much fewer labels than passive learning. The 
number of these queried labels, which is necessary and sufficient for obtaining a good leaner, is 
well-known as the sample complexity of active learning. 

Many theoretical bounds on the sample complexity of active learning have been derived based 
on the realizability assumption (i.e., there exists a hypothesis perfectly separating the data in 
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the hypothesis class) 



[4l. la. Illl. I12I. Il4l. Il6l| ■ The reahzabihty assumption, however, rarely holds in 



practice. Recently, the sample complexity of active learning in the non-realizable case (i.e., the 
data cannot be perfectly separated by any hypothesis in the hypothesis class because of the noise) 
has been studied [3, llj, ll7]. It is worth noting that these bounds obtained in the non-realizable 
case match the lower bound 0.{\) [l9[, in the same order as the upper bound O(^) of passive 
learning (77 denotes the generalization error rate of the optimal classifier in the hypothesis class 
and e bounds how close to the optimal classifier in the hypothesis class the active learner has to 
get). This suggests that perhaps active learning in the non-realizable case is not as efficient as that 
in the realizable case. To improve the sample complexity of active learning in the non-realizable 
case remarkably, the model of the noise or some assumptions on the hypothesis class and the 



21[ is more and more popular in 



data distribution must be considered. Tsybakov noise model 

theoretical analysis on the sample complexity of active learning. However, existing result [8| 
shows that obtaining exponential improvement in the sample complexity of active learning with 
unbounded Tsybakov noise is hard. 



Inspired by [235 which proved that multi-view setting [a] can help improve the sample complexity 
of active learning in the realizable case remarkably, we have an insight that multi-view setting 
will also help active learning in the non-realizable case. In this paper, we present the first analysis 
on the sample complexity of active learning in the non-realizable case under multi-view setting, 
where the non-realizability is caused by Tsybakov noise. Specifically: 



231] to the non-realizable case. 



-We define a- expansion, which extends the definition in |3] and 
and /3-condition for multi-view setting. 

-We prove that the sample complexity of active learning with Tsybakov noise under multi-view 
setting can be improved to O(log-) when the learner satisfies non-degradation condition]^ This 
exponential improvement holds no matter whether Tsybakov noise is bounded or not, contrasting 
to single-view setting where the polynomial improvement is the best possible achievement for 
active learning with unbounded Tsybakov noise. 

-We also prove that, when non-degradation condition does not hold, the sample complexity of 
active learning with unbounded Tsybakov noise under multi-view setting is O(-), where the order 



^The O notation is used to hide the factor loglog(i] 



of 1/e is independent of the parameter in Tsybakov noise, i.e., the sample complexity is always 
O(-) no matter how large the unbounded Tsybakov noise is. While in previous polynomial 
bounds, the order of 1/e is related to the parameter in Tsybakov noise and is larger than 1 
when unbounded Tsybakov noise is larger than some degree (see Section 2). This discloses 
that, when non-degradation condition does not hold, multi-view setting is still able to lead to a 
faster convergence rate and our polynomial improvement in the sample complexity is better than 
previous polynomial bounds when unbounded Tsybakov noise is large. 

The rest of this paper is organized as follows. After introducing related work in Section 2 and 
preliminaries in Section 3, we define a-expansion in the non-realizable case in Section 4. Then we 
analyze the sample complexity of active learning with Tsybakov noise under multi-view setting 
with and without the non-degradation condition in Section 5 and Section 6, respectively, and 
verify the improvement in the sample complexity empirically in Section 7. Finally we conclude 
the paper in Section 8. 

2. Related Work 

Generally, the non-realizability of learning task is caused by the presence of noise. For learning 
the task with arbitrary forms of noise, Balcan et al. [^ proposed the agnostic active learning 
algorithm A^ and proved that its sample complexity is 0(t-)g Hoping to get tighter bound on 
the sample complexity of the algorithm A^, Hanneke [iTJ] defined the disagreement coefficient 9, 
which depends on the hypothesis class and the data distribution, and proved that the sample 
complexity of the algorithm A^ is 0{6'^\). Later, Dasgupta et al. [ij] developed a general 
agnostic active learning algorithm which extends the scheme in [lOJ] and proved that its sample 
complexity is 0{6\). 

I — I 

Recently, the popular Tsybakov noise model ^] was considered in theoretical analysis on active 

learning and there have been some bounds on the sample complexity. For some simple cases, 
where Tsybakov noise is bounded, it has been proved that the exponential improvement in the 
sample complexity is possible [J, lZ|, ll8|. As for the situation where Tsybakov noise is unbounded. 



^The O notation is used to hide the factor polylog{-). 



only polynomial improvement in the sample complexity has been obtained. Balcan et al. J] 
assumed that the samples are drawn uniformly from the the unit ball in R and proved that the 

2 

sample complexity of active learning with unbounded Tsybakov noise is 0(e^i+>^) (A > depends 
on Tsybakov noise). This uniform distribution assumption, however, rarely holds in practice. 



Castro and Nowak [8[ showed that the sample complexity of active learning with unbounded 
Tsybakov noise is 0(e 't^ ) (^ > 1 depends on another form of Tsybakov noise, uo > \ 

depends on the Holder smoothness and d is the dimension of the data) . This result is also based 

n 

on the strong uniform distribution assumption. Cavallanti et al. [9!] assumed that the labels of 
examples are generated according to a simple linear noise model and indicated that the sample 

2(3+A) J I 

complexity of active learning with unbounded Tsybakov noise is Oie (i+^)(2+a) j. Hanneke |18l | 



proved that the algorithms or variants thereof in [2] and [IS] can achieve the polynomial sample 

^, 2_. 

complexity 0{e i+^j for active learning with unbounded Tsybakov noise. For active learning 
with unbounded Tsybakov noise, Castro and Nowak [g] also proved that at least VL{€^p) labels are 
requested to learn an e-approximation of the optimal classifier [p € (0, 2) depends on Tsybakov 
noise). This result shows that the polynomial improvement is the best possible achievement for 
active learning with unbounded Tsybakov noise in single-view setting. Wang [22] introduced 
smooth assumption to active learning with approximate Tsybakov noise and proved that if the 
classification boundary and the underlying distribution are smooth to ^-th order and ^ > d, 
the sample complexity of active learning is 0[e «+'*]; if the boundary and the distribution are 
infinitely smooth, the sample complexity of active learning is 0[polylog{j)). Nevertheless, this 
result is for approximate Tsybakov noise and the assumption on large smoothness order (or 
infinite smoothness order) rarely holds for data with high dimension d in practice. 

3. Preliminaries 

In multi-view setting, the instances are described with several different disjoint sets of features. 
For the sake of simplicity, we only consider two-view setting in this paper. Suppose that X = 
Xi X X2 is the instance space, Xi and X2 are the two views, Y = {0, 1} is the label space and 
V is the distribution over X xY. Suppose that c = (ci, C2) is the optimal Bayes classifier, where 
ci and C2 are the optimal Bayes classifiers in the two views, respectively. Let Hi and T-L2 be 
the hypothesis class in each view and suppose that ci € Hi and C2 € 7^2- For any instance 



X = (xi,X2), the hypothesis hv € V-v (v = 1,2) makes that hv{xv) = 1 if x^, G S^, and hy{xy) = 
otherwise, where 5„ is a subset oi Xy. In this way, any hypothesis h^ € Tiv corresponds to a subset 
Sy of Xy (as for how to combine the hypotheses in the two views, see Section 5). Considering that 
xi and X2 denote the same instance x in different views, we overload S^ to denote the instance 
set {x = (xi, X2) : X y € Sy} without confusion. Let S* correspond to the optimal Bayes classifier 



Cy. It is well-known 15| that S* = {xy : ipy{xy) > ^}, where ipy{xy) = P{y = l\xy). Here, we also 
overload S"* to denote the instances set {x = (xi, X2) : x^, € S**}. The error rate of a hypothesis Sy 
under the distribution V is R{hy) = R{Sy) = Pr(^xi,x2,y)ev{y ¥" ^i^v £ Sy)). In general, R{Sy) 7^ 
and the excess error of Sy can be denoted as follows, where SyAS* = {Sy — S*) U {S* — Sy) and 
d{Sy,S*) is a pseudo-distance between the sets Sy and S*. 

R{Sy) - R{S:) = I \2^y{Xy) - 1K4„ ^ d{Sy, S^ (l) 

Let Tjy denote the error rate of the optimal Bayes classifier Cy which is also called as the noise rate 
in the non-realizable case. In general, rjy is less than 2- In order to model the noise, we assume 
that the data distribution and the Bayes decision boundary in each view satisfies the popular 
Tsybakov noise condition 2l|] that Prx^^x^{\'^v{xy) — 1/2| < t) < Cot^ for some finite Co > 0, 



A > and all < t < 1/2, where A = 00 corresponds to the best learning situation and the noise 
is called bounded [8]; while A = corresponds to the worst situation. When A < 00, the noise is 
called unbounded [8]. According to Proposition 1 in [21], it is easy to know that ([2]) holds. 

d{Sy,S*y)>Cldi{Sy,S*y) (2) 

Here k = ^, Ci = 2C~^^^X{\ + 1)-^-^/^, dA{Sy,S*) = Pr{Sy - S*) + Pr{S* - Sy) is also a 
pseudo-distance between the sets Sy and S*, and d{Sy,S*) < dA{Sy,S*) < 1. We will use the 
following lamma [l| which gives the standard sample complexity for non-realizable learning task. 

Lemma 1 Suppose that % is a set of functions from X to Y = {0, 1} with finite VC-dimension 
V > 1 and V is the fixed but unknown distribution over X x Y . For any e, 5 > 0, there is a 
positive constant C, such that if the size of sample {{x^,y^), . . . , {x^ ,y^)} from T> is N{e,6) = 
^[V + log(j)"), then with probability at least 1 — 6, for all h Gl-L, the following holds. 



4. a-Expansion in the Non-realizable Case 

Multi-view active learning first described in [20] focuses on the contention points (i.e., unlabeled 
instances on which different views predict different labels) and queries some labels of them. It is 
motivated by that querying the labels of contention points may help at least one of the two views 
to learn the optimal classifier. Let Si (B S2 = {Si — S'2) U (^2 — 5*1) denote the contention points 
between Si and S2, then Pr(Si © S2) denotes the probability mass on the contentions points. 
"A" and "©" mean the same operation rule. In this paper, we use "A" when referring the excess 
error between Sv and S* and use "©" when referring the difference between the two views Si and 
S'2. In order to study multi-view active learning, the properties of contention points should be 
considered. One basic property is that Pr{Si © S2) should not be too small, otherwise the two 
views could be exactly the same and two-view setting would degenerate into single-view setting. 

In multi-view learning, the two views represent the same learning task and generally are consistent 
with each other, i.e., for any instance x = (2;i,2;2) the labels of x in the two views are the same. 
Hence we first assume that S^ = S2 = S* . As for the situation where S^ ^ S2, we will 
discuss on it further in Section 5.2. The instances agreed by the two views can be denoted as 
(Si n S2) U (Si n S2). However, some of these agreed instances may be predicted different label 
by the optimal classifier S*, i.e., the instances in (Si n S2 — S*) U (Si n S2 — S*). Intuitively, 
if the contention points can convey some information about (Si n S2 — S*) U (Si fl S2 — S*), 
then querying the labels of contention points could help to improve Si and S2. Based on this 
intuition and that Pr{Si © S2) should not be too small, we give our definition on a-expansion in 
the non-realizable case. 

Definition 1 T> is a-expanding if for some a > and any Si Q Xi, S2 C X2, ^ holds. 

Pr(Si©S2) >a(Pr(SinS2-S*)+Pr(Srn:^-S^)) (3) 

We say that T> is a-expanding with respect to hypothesis class Tii x 'H2 if the above holds for all 
Si £ TiiCiXi, S2 & 1^2 n X2 (here we denote by Tiv n X^ the set {hnXy : h £ Hy} for v = 1, 2). 



E 



Balcan et al. [3| also gave a definition of expansion, Pr(Ti©T2) > amin [Pr(TinT2), Pr(rinr2)] , 
for realizable learning task under the assumptions that the learner in each view is never "confident 
but wrong" and the learning algorithm is able to learn from positive data only. Here Ty denotes 



Table 1: Multi-view active learning with the non-degradation condition 



Input: Unlabeled data set U — {x^ ,x^ , ' " ' > } where each example x^ is given as a pair (xj, Xj) 
Process: 

Query the labels of mo instances drawn randomly from lA to compose the labeled data set C 
iterate: i = 0, 1, • • • , s 

Train the classifier h]^ (w = 1, 2) by minimizing the empirical risk with C in each view: 

/ij, = argmin^g^^ il(x^,x^,v)&c'^iKxv) + y)\ 
Apply h\ and h\ to the unlabeled data set U and find out the contention point set Q,i\ 
Query the labels of tti^+i instances drawn randomly from Q^, then add them into L and delete 
them from lA. 
end iterate 
Output: h\_ and h^_ 



the instances which are classified as positive confidently in each view. Generally, in realizable 
learning tasks, we aim at studying the asymptotic performance and assume that the performance 
of initial classifier is better than guessing randomly, i.e., Pr{T^) > 1/2. This ensures that 
Pr(Ti n T2) is larger than Pr(Ti H T2). In addition, in [3|] the instances which are agreed by the 
two views but are predicted different label by the optimal classifier can be denoted as Ti fl T2. 
So, it can be found that Definition [1] and the definition of expansion in (3] are based on the same 
intuition that the amount of contention points is no less than a fraction of the amount of instances 
which are agreed by the two views but are predicted different label by the optimal classifiers. 

5. Multi-view Active Learning w^ith Non-degradation Condition 

In this section, we first consider the multi-view learning in Table 1 and analyze whether multi- 
view setting can help improve the sample complexity of active learning in the non-realizable case 
remarkably. In multi-view setting, the classifiers are often combined to make predictions and 
many strategies can be used to combine them. In this paper, we consider the following two 
combination schemes, /i+ and /i_, for binary classification: 

. 1 ifh](xi) = hi(x2) = l fo ii hUxi) = hUx2) = 

otherwise I 1 otherwise 

5.1. The Situation Where S^ = S2 

With (j4| , the error rate of the combined classifiers h\ and hl_ satisfy ([5]) and ([6]) , respectively. 



i?(/iV) - R{s*) = Ris{ n si) - R{s*) < d^{s\ n si S*) (5) 

R{hl) - R{S*) = R{Si U si) - R{S*) < dA{S\ U Sl S*) (6) 

Here SI C X^ (v = 1, 2) corresponds to the classifier /i^ G Ti^ in the i-th round. In each round of 
multi-view active learning, labels of some contention points are queried to augment the training 
data set C and the classifier in each view is then refined. As discussed in [23|, we also assume that 
the learner in Table 1 satisfies the non- degradation condition as the amount of labeled training 
examples increases, i.e., d?]) holds, which implies that the excess error of S"*"*"^ is no larger than 



that of SI in the region of SI® S2- 



Pr{Si+'^AS*\Si e Si) < Pr{SiAS*\Si Si) (7) 

To illustrate the non-degradation condition, we give the following example: Suppose the data in 
Xy {v = 1, 2) fall into n different clusters, denoted by vr^, . . . , vr^, and every cluster has the same 
probability mass for simplicity. The positive class is the union of some clusters while the negative 
class is the union of the others. Each positive (negative) cluster tt'^ in X^ is associated with only 3 
positive (negative) clusters vr^"^ (^,? G {1, . . . ,n}) in X^^y (i.e., given an instance Xy in vr^ , x^^y 
will only be in one of these ■k^~'"). Suppose the learning algorithm will predict all instances in 
each cluster with the same label, i.e., the hypothesis class Tiy consists of the hypotheses which do 
not split any cluster. Thus, the cluster vr^ can be classified according to the posterior probability 
P{y = lIvTc) and querying the labels of instances in cluster ttV will not influence the estimation of 
the posterior probability for cluster vr^ (? 7^ C)- It is evident that the non-degradation condition 
holds in this task. Note that the non-degradation assumption may not always hold, and we will 
discuss on this in Section 6. Now we give Theorem [TJ 

Theorem 1 For data distribution T> a-expanding with respect to hypothesis class Tii x 7^2 (ac- 
cording to Definition [IJ when the non- degradation condition holds, if s = \- f^] and rui = 

^ (y + log( — ^ — -)) , the multi-view active learning in Table 1 will generate two classifiers h^ 
and /il, at least one of which is with error rate no larger than R{S*) + e with probability at least 
1-6. 

Here, V = niax[VC{'Hi),VC{T-L2)] where VC{7i) denotes the VC-dimension of the hypothesis 
class n,k = ^^, Ci = 2C(7^/^A(A + l)-i-i/A and C2 = |f±|. 



Proof: Let Qi = Si® S^- First we prove that if each view X^ {v = 1,2) satisfies Tsybakov 
noise condition, i.e., Prx^(^Xy{\Vv{xv) — 1/2| < t) < Cst'^^ for some finite C3 > 0, A3 > and ah 

< t < 1/2, Tsybakov noise condition can also be met in Qj, i.e., — ""'^'^' pLq\ ^— < Cit^^ 

for some finite C4 > 0, A4 > and ah < t < 1/2. Suppose Tsybakov noise condition cannot 
be met in Qi, then for C=k = pSp.) and A* = A3, there exists some < t* < 1/2 to satisfy that 

Pr.^exAlMxv) - 1/2| < t) > Pr^^egAlMxv) - 1/2| < t) > Cst^'. 

It is in contradiction with that X^ satisfies Tsybakov noise condition. Thus, we get that Tsybakov 
noise condition can also be met in Qi. Without loss of generality, suppose that Tsybakov noise 
condition in all Qi and X^ can be met for the same finite Cq and A. 

Since mo = ^^2 {V + log( — -^ — -)), according to Lemma [T] we know that d{Sy, S*) < j^ with 
probability at least 1 - jq(^- With d{S^,S*) > Cid''^{S^,S*), we get dAiS^,S*) < ^. It is 
easy to find that d^iS^ n S^, S*) < d^iS^, S*) + dA{S^, S*) < 1/8 holds with probability at least 

For i > 0, mj+i number of labels are queried randomly from Qi. Thus, similarly according to 
Lemma[T]we have d^iSl^^ n 82^^^ \ Qi, S* \ Qi) < 1/8 with probability at least 1 - gT^rxy- Let 
T^*+^ = 5";+^ n Qi and tj+i = "^^l^.^i^^^+i^ - i, it is easy to get 

Pr{S* n {S\+^ e Si+^)\Q?j - Pr{'S^n{Si+' e Si+^)\Ql) = -2Ti+iPr{S\+^ e S\+^\Ql). 

Considering the non-degradation condition and dA{S\ n S'!2\Qi,S*\Qi) = di\{Sv\Qi-,S*\Qi), we 
calculate that 

dA{S\+^r\S\+^\Ql,S*\Q:i) 

= \[dA{s\+^\Wi,s*\Qi) + dA{S'+^nuS*m) + \p^{s* n {s\+^®si+^)U 



--Pr(5*n(5i+ie5^+^)|Qi) 



< ^[dA{S{\Qi,S*\Qi) + dA{Sl\Q^,S*\Q^)] - Ti+iPr{S\^' ® S\+'\Q^ 



dA{S\ n Sl\Qi,S*\Qi) - Ti+iPr{Sl+' e 5^+^IQ, 



i+l ,-^ oi+l I 



So we have 



dAis{+^ n S'2^\ s*) 

dAiSt' n Si+'\Qi, S*\Qi)Pr{Q,) + dA{St' n Si+'iO;, S*mPrm 



< 



1 



Pr{Q^) + dAiSl n SllQi, S*\Qi)Pr{Q,) - Ti+iPr{{S{+^ S'+^) n Qi). 



Considering dA^Sl n Si\Qi, S*\Qi)Pr{Qi) = PriS{ n S^ - S*) + Pr(5^ n S^ - 5*), we have 

dA{si-^^nsi+\s*) 

< Pr{S\ r^S\- S*) + Pr(5[n 5| - 5^) + \Pr{S\ 5^) - r,+iPr((5i+^ S'+^) nQl). 

o 

Similarly, we get 

dA{Si^^ U S'^+\ S*) 

< Pr{S\ r^Si- S*) + Pr(5[n 5| - 5^) + \Pr{S\ 5^) + ri+iPr((5i+i S'+^) n Q^). 



P-r(5';e5'^-5"') _ 1 
Pr(S\®S\) 



Let 7i = pJiK^a:,Ri\ - 2' ^6 ^^^^6 



dA{S{r\SlS*) = dA{S{r\Si\Q,,S*\Q.i)Pr{Q,) + dA{S{r\S\\Q,,S*\Qi)Pr{Qi) 
= (1/2 - -ii)Pr{S\ © 5^) + Pr{S\ r\S\- S*) + Pr(S\ D^-S^) 

and dAiSl U 5|, S*) = (1/2 + 7i)i^r(5i S^) + Pr{Si n 5| - S*) + Pr(^ n ^ - :S^). 

As in each round of the multi-view active learning some contention points of the two views are 
queried and added into the training set, the difference between the two views is decreasing, i.e., 
Pr{Si^^ © 5^+^) is no larger than Pr{S\ 5|). 

Case 1: If |tj_|-i| < 7j, with respect to Definition [H we have 

dA{Sl-''^Si+\S*) ^ iPrjSJ Sp + |r,+i|Pr(5i+^ © S^+^) + ^Pr{S\ g^) 



dA{S\USlS*) - ( 1 + ^.)Pr(5i 5|) + iPr(5i © 5|) 

^ (| + 70^r(Si©S^) + lPr(5i©5|) ^ 5« + 8 ^ 
- (i + 7i)^r(5i © 5^) + ^Pr(5i © 5^) - 8a + 8 ' 

Case 2: If — [Ti+i| > ji, with respect to Definition [H we have 

dA{Sl-^'nSi-^\S*) ^ iPr(5i0 5i) + h+i|Pr(5i+i©5^+i) + iPr(5i0 5^) 



dAiSl n 5|, 5*) - (i + |7d)Pr(5i 5^) + iPr(5i © ^2) 



^2 
5a + 8^ 

~ 8a + 8' 

10 



Case 3: If tj+i > 7^ and < 7^ < j, with respect to Definition [H we have 



dAiSlnS^S*) - ( 1 - 7,)Pr(5i © 5i) + iPr(5i e 5|) 

a + 8 



< 



2a + 8' 



Case 4: If tj+i > 7j and 4 < 7i < 2' '^ith respect to Definition [H we have 

dA(5i U SI S*) - (i + 7.)Pr(5i 5^) + iPr(5i ® ^2) 

^ 5a + 8 



6a + 



Case 5: If Tj+i < 7j and — | < 7i < 0, with respect to Definition [H we have 

dA{S\+' U Si+\ S*) ^ iPriSi Si) + ^Pr(5i Si) 



dA{SlUSl,S*) - ( 1 + 7,)Pr(5i 5|) + iPr(5i S^) 

a + 8 



< 



2a + 8' 



Case 6: If Tj+i < 7j and —^ < 7j < —4, with respect to Definition [H we have 

dA{S\^'nS'^+\S*) ^ iPrjSl 5^) + |r,+i|Pr(5i+^ S^') + ^PrjSl 5^) 
dA(5i n SI, S*) - (i + |7^|)Pr(5i 5^) + iPr(5i © ^2) 

5a + 8 



< 



6a + 8' 



Case 7: If tj+i < —74 and < 7^ < gj with respect to Definition [H we have 

dAiSi+^US'^+\S*) ^ iPrjSl 5^) + ^PriSJ Sj) 

dAiS\USi,S*) - (^+^,)Pr{SieSi) + ^PriSi®Si: 

a + 8 



< 



4a + 8' 



Case 8: If Tj+i > — 7^ and — ^ < 7i < 0, with respect to Definition [U we have 

dA{S{+'nS'^+\S*) ^ iPr(5i0 5|) + iPr(5i0 5^) 



dAiSlnS^S*) - (i + |7,|)Pr(5l0 5^) + ^Pr(5l0 5^) 

a + 8 



< 



4a + 8' 



Thus, after the {i + l)-th round, either '^^j^l^g^f^ < fiji o^ '^^^s^^ < M holds. 
Hence, we have dA(5f nS|,5*) < |(|§^)' ^ or dA(5f US|,5*) < |(|^)' ^ with probabihty 

11 



at least 1 — 6. When s = \- P-~\, where Co = a""!"! is a constant less than 1, we have either 

I log 7j- oa+S 

dA{Sf n 51,5*) < e or dA{Sf U 5|,5*) < e with probability at least 1-6. Thus, considering 
R{hX)-R{S*) = R{SinS'2)-R{S*) < dA{SinSl S*) and R{hL)-R{S*) = i?(5iu5|)-i?(5*) < 
dA{Si U 5^, S*), we have either i?(/i^) < R{S*) + e or R{h'_) < R{S*) + e. D 



Prom Theorem [T] we know that we only need to request X]i=o'^« ~ 0(log -) labels to learn /i^ 
and ht, at least one of which is with error rate no larger than R{S*) + e with probability at 
least 1 — 6. If we choose h^_^_ and it happens to satisfy R{h^) < R{S*) + e, we can get a classifier 
whose error rate is no larger than R(S*) + e. Fortunately, there are only two classifiers and the 
probability of getting the right classifier is no less than i. To study how to choose between /i^_ 
and /i!_, we give Definition [2] at first. 

Definition 2 The multi-view classifiers Si and S2 satisfy /3-condition if ^ holds for some /3 > 0. 
Pr{{x : X G 5i e 52 A y{x) = 1}) Pr{{x : x € 5i 52 A y{x) = 0}) 



Pr{Si e 52) Pr(5i e 52) 



>/3 (8) 



(jH]) implies the difference between the examples belonging to positive class and that belonging 
to negative class in the contention region of 5i © 52 . Based on Definition [2l we give Lemma [2] 
which provides information for deciding how to choose between /i+ and h-. This helps to get 
Theorem [2j 

Lemma 2 If the multi-view classifiers Sf and 51 satisfy /^-condition, with the number of — wr^ 
labels we can decide correctly whether Pr[{x : x e 5f © 5| A y{x) = 1}) or Pr[{x : x G 
5f © 5| A y{x) = 0})) is smaller with probability at least 1 — 6. 

Proof: We apply Sf and 51 to the unlabeled instances set and identify the contention point 
set. Then we query for labels of — -gr^ instances drawn randomly from the contention points set. 
With these labels we estimate the empirical value Pi of — „ /gs^^cs^ and the empirical 

value P2 of ^ ^'^ Prls'i^S^) • "^y Chernoff bound, with number of — wr^ labels we have the 

following two equations with probability at least 1 — 6. 

Pie 



Pr({x:xG5f ©5|Ay(x) = 1}) /3 Pr({x : x G 5f © 5| A y(x) = 1}) /3 



P2e 



Pr(5f©5|) 2' Pr(5f©5|) 2. 

Pr{{x : X G 5f © 5| A y{x) = 0}) (3 Pr{{x : x e Sf © 5| A y(x) = 0}) /3 



Pr{SfeSI) 2' Pr{Sf®SI) 2 
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If Pi < P2, we get Pr{{x : x E 5f 5| A y{x) = 1}) < Pr{{x : x £ Sf e S^ A y{x) = 0}) with 
probability at least 1 — 5\ otherwise, we get Pr(^{x : x G 5f © 5| A y{x) = 1}) > Pr({x : x G 
Sf 51 A y{x) = 0}) with probability at least 1-5. U 

Theorem 2 For data distribution T> a-expanding with respect to hypothesis class Tii x 'H2 ac- 
cording to Definition [TJ when the non- degradation condition holds, if the multi-view classifiers 
satisfy fi-condition, by requesting 0(log -) labels the multi-view active learning in Table 1 will 
generate a classifier whose error rate is no larger than R{S*) + e with probability at least 1 — 5. 

Proof: According to Theorem [U by requesting O(logi) labels the multi-view active learning 
in Table 1 can get either R{h%) < R{S*) + e or i?(/ii) < R{S*) + e with probability at least 
1 — 2- According to Lemma [21 by requesting — ^^^ labels we can decide correctly whether 
Pr({x : X G S'f S'l A y(x) = 1}) or Pr ({x : x G S'f ^f A y{x) = 0}) is smaller with probability 
at least 1 — f • 

Case 1: If Pr{{x : x G S'f © S"! A y{x) = 1}) < Pr{{x : x G S'f ^f A y{x) = 0}), we have 
R{ht) < R{h%). Thus, we get Rihl) < R{S*) + e with probability at least 1-5. 

Case 2: If Pr{{x : x G S'f © S"! A y{x) = 1}) > Pr{{x : x G 5f 5| A y{x) = 0}), we have 
R{h%) < R{ht). Thus, we get R{h%) < R{S*) + e with probability at least 1-5. 

The total number of labels to be requested is 0(log ^) -\ g^ = 0(log ^). D 



Prom Theorem [2] we know that we only need to request 0(log -) labels to learn a classifier with 
error rate no larger than R{S*)-\-e with probability at least 1 — 5. Thus, we achieve an exponential 
improvement in sample complexity of active learning in the non-realizable case under multi-view 
setting. Sometimes, the difference between the examples belonging to positive class and that 
belonging to negative class in 5f © 5| may be very small, i.e., @ holds. 

Pr{{x : X G 5f © S"! A y{x) = 1}) Pr{{x : x G 5f S"! A y{x) = 0}) 



Pr{S( © SI) Pr{Sl © S|^ 



0(e) (9) 



'2 J 

If so, we need not to estimate whether -R(/i+) or i?(/ii) is smaller and Theorem [3] indicates that 
both h^ and h'L are good approximations of the optimal classifier. 

13 



Theorem 3 For data distribution D a-expanding with respect to hypothesis class Hi x'H2 accord- 
ing to Definition {1\ when the non- degradation condition holds, if (0) is satisfied, by requesting 
0(log -) labels the multi-view active learning in Table 1 will generate two classifiers /i^ and 
/if_ which satisfy either (a) or (b) with probability at least 1 — 6. (a) R{h^) < R{S*) + e and 
Rihl) < R{S*) + 0(e); (b) R{h%) < R{S*) + 0(e) and R{ht) < R{S*) + e. 



Proof: Since Pr{Si 5'!) < 1, with the following equation 

Pr{{x : x € 5f e 5| A y(x) = 1}) Pr{{x : x G 5f 5| A y{x) = 0}) 



Oie) 



Pr{Sl e SI) Pr{Sl e 5|) 

we have \Pr{{x : x £ Sf (B S^ A y{x) = 1}) - Pr({x : x G 5f 5| A y{x) = 0}) | = 0(e). So it is 
easy to get \R{h'\) — R{h'L)\ = 0{e). According to Theorem[Tl by requesting 0(log i) labels we 
can get either R{h%) < R{S*) + e or i?(/ii) < R{S*) + e with probability at least 1-6. Thus, 
we get that /il and h'L satisfy either (a) or (b) with probability at least 1 — 6. D 



5.2. The Situation Where Sf / S^ 

Although the two views represent the same learning task and generally are consistent with each 
other, sometimes S^ may be not equal to S^. Therefore, the a-expansion assumption in Definition 
[U should be adjusted to the situation where S^ j^ S^. To analyze this theoretically, we replace 
S* by 5^ n S2 in Definition [T] and get ([TU|). Similarly to Theorem [H we get Theorem HI 



Pr {Si © ^2) > a Pr{Si D S2 - S*^ n ^2*) + Pr (5i n ^2 - S^ n S^) (10) 



Theorem 4 For data distribution T> a-expanding with respect to hypothesis class Tii x 'H2 ac- 
cording to ilO\) . when the non- degradation condition holds, if s = \- r^] and rui = ^^^^ (V + 

log( — ^ — -)) , the multi-view active learning in Table 1 will generate two classifiers h^ and /if_, 
at least one of which is with error rate no larger than R{Si Ci S^) + e with probability at least 
1 — 6. (V , k, Ci and C2 ore given in Theorem\^) 

Proof: Since 5* is the optimal Bayes classifier in the v-th view, obviously, R{Sl n ^2) is no less 
than R{S*), (u = 1, 2). So, learning a classifier with error rate no larger than R{Si H S2) + e is 
not harder than learning a classifier with error rate no larger than R{S*) + e. Now we aim at 
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learning a classifier with error rate no larger than R{Sl H S2) + e- Without loss of generality, 
we assume R{Sl) > i?(5* PI S2) for i = 0, 1, . . . , s. If R{Sl) < R{Si n 5*2), we get a classifier 
with error rate no larger than R{S^ Pi Sg) + e. Thus, we can neglect the probability mass on the 
hypothesis whose error rate is less than R{Sl fl ^2) and regard S^ fl 5| as the optimal. Replacing 
S* by Si n 5*2 in the discussion of Section 5.1, with the proof of Theorem [1] we get Theorem U] 
proved. D 

Theorem m shows that for the situation where 5'J' 7^ 5*2, by requesting 0(log -) labels we can learn 
two classifiers h^ and ht, at least one of which is with error rate no larger than R{Sl n S*!) + e 
with probability at least 1 — 6. With Lemma [21 we get Theorem [5] from Theorem HI 

Theorem 5 For data distribution V a-expanding with respect to hypothesis class %i x 7^2 o-c- 
cording to jiO)) . when the non- degradation condition holds, if the multi-view classifiers satisfy 
^-condition, by requesting O(log-) labels the multi-view active learning in Table 1 will generate 
a classifier whose error rate is no larger than R{Sl n S^) + e with probability at least 1 — 5. 

Proof: According to Theorem HI by requesting 0(log -) labels the multi-view active learning in 
Table 1 can get either R{h%) < R{Sl n S^) + e or R{ht) < R{S1 n 5^) + e with probability at 
least 1 — 2- According to Lemma [21 by requesting — qt^ labels we can decide correctly whether 
Pr[{x : X £ Sf (B S2 /\ y{x) = 1}) or Pr({x : x G 5f © 51 A y{x) = 0}) is smaller with probability 
at least 1 — |- 

Case 1: If Pr{{x : a; G S'f © Sf A y{x) = 1}) < Pr{{x : x e 5f © 5| A y{x) = 0}), we have 
fi(/ii) < R{h%). Thus, we get i?(/ii) < R{Sl n 5^) + e with probability at least 1 - 5. 

Case 2: If Pr[{x : x G 5f © Sf A y{x) = 1}) > Pr[{x : a; G ^f © S"! A y{x) = 0}), we have 
R{h%) < R{ht). Thus, we get R{h%) < R{S1 n 5^) + e with probability at least 1 - 6. 

The total number of labels to be requested is 0(log ^) -\ 35-^ = 0(log j). El 



Generally, i?(S'JnS'2) is larger than R{Sl) and R{S2)- When S^ is not too much different from S2, 
i.e., Pr{Si (B S2) < e/2, we have Corollary [1] which indicates that the exponential improvement 
in the sample complexity of active learning with Tsybakov noise is still possible. 
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Corollary 1 For data distribution V a-expanding with respect to hypothesis class Tii x ■H2 clc- 
cording to ilO\) . when the non- degradation condition holds, if the multi-view classifiers satisfy 
(^-condition and Pr{Sl 5*2) < e/2, by requesting 0{log-) labels the multi-view active learning 
in Table 1 will generate a classifier with error rate no larger than R{S*) + e (v = l,2j with 
probability at least 1 — 5. 

Proof: According to Theorem [5] we know that by requesting 0(log j) labels the multi-view active 
learning in Table 1 will generate a classifier whose error rate is no larger than R{Si fl S^) + | 
with probability at least 1 — 6. Considering that 

Risi n 3*2) - Ris:) = [ \2Mxv) - MpxJx. < Prisi e s*^), 

J(5*n5*)AS; 
we have R{S^ n S^) < R{S*) + f . Thus, we get that R{Sl n 5|) + f is no larger than i?(5*) + e. 

D 



6. Multi-vievif Active Learning without Non-degradation Condition 

Section 5 considers situations when the non-degradation condition holds, there are cases, however, 
the non-degradation condition d?]) does not hold. In this section we focus on the multi-view active 
learning in Table 2 and give an analysis with the non-degradation condition waived. Firstly, 
we give Theorem [6] for the sample complexity of multi-view active learning in Table 2 when 
5* = S* = S*. 

Theorem 6 For data distribution V a-expanding with respect to hypothesis class Tii x 7^2 clc- 
cording to DefinitionU\ if s = \- — ^] and mi = ^^3^(1/ -|- log( — ^^ — -)) , the multi-view active 
learning in Table 2 will generate two classifiers h^ and ht^, at least one of which is with error 
rate no larger than R(S*) -\- e with probability at least 1 — 5. (V, k, Ci and C2 are given in 
Theorem Ul) 

Proof: After the i-th. round in Table 2, the number of training examples in C is '}2b=o ^^w-i = 
j-2«+i _ I'jiji-^ "While in the {i -|- l)-th round, we randomly query (2*"*"^ — l)mj labels from the 
region of Qi and add them into C. So in the {i -\- l)-th round, the number of training examples 
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Table 2: Multi-view active learning without the non-degradation condition 

Input: Unlabeled data set U — {x^ ,x^ , ' " ' > } where each example x^ is given as a pair (xj, Xj) 
Process: 

Query the labels of m^ instances drawn randomly from lA to compose the labeled data set C; 
Train the classifier h'^ (y — 1,2) by minimizing the empirical risk with C in each view: 

hi = argmin,,g«„ 't.(x,.x:,.,y)<^c KK^v) 7^ v)\ 
iterate: i = 1, • • • , s 

Apply h\^'^ and /ij^^ to the unlabeled data set U and find out the contention point set Q^; 
Query the labels of mi instances drawn randomly from Qi , then add them into C and delete them 
from U; 

Query the labels of (2* — \)mi instances drawn randomly from U — Qi, then add them into C and 
delete them from U\ 

Train the classifier h\ by minimizing the empirical risk with C in each view: 
hi = argmin,jgH„ Y.[x^,x^,y)ec'^iKxv) + v)- 
end iterate 
Output: h^ and /il 



for S\^^ {v = 1, 2) drawn randomly from region of Qi is larger than the number of whole training 
examples for 5*. Since the optimal Bayes classifier c^ belongs to Ti^, according to the standard 
PAC-model, it is easy to know that d{Sl'^^\Qi, S*\Qi) < d{Sl\Qi,S*\Qi) can be met for any tpy, 
where d{Sv\Qi,S*\Qi) is defined as 

d{S,\Qi,S*m = R{S.m-R{S*m= [ _ _\2^,{x„)-l\p,J,JPr{Ql). 

J{S^nQ,)A{s*nQ,) 

So, by setting tpy € {0,1}, we get dAiSl'^^lQi, S*\Qi) < dA{Sl\Qi, S*\Qi), which implies the 
non-degradation condition. Thus, with the proof of Theorem [H we get Theorem[6] proved. D 



Theorem [6] shows that we can request X]i=o^*"^« ~ 0{-) labels to learn two classifiers /i^ and 
/if_, at least one of which is with error rate no larger than R{S*) + e with probability at least 
1 — 6. To guarantee the non-degradation condition ([7]), we only need to query (2* — l)m,j more 
labels in the i-th round. With Lemma [2l we get Theorem [71 

Theorem 7 For data distribution T> a-expanding with respect to hypothesis class Tii x 'H2 ac- 
cording to DefinitionUl if the multi-view classifiers satisfy (^-condition, by requesting O(-) labels 
the multi-view active learning in Table 2 will generate a classifier whose error rate is no larger 
than R{S*) -\- e with probability at least 1 — 5. 

Proof: According to Theorem [H by requesting O(^) labels the multi-view active learning in 
Table 2 will generate two classifiers h^ and ht_ , at least one of which is with error rate no larger 
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than R{S*) + e with probabihty at least 1 — 6. Similarly to the proof of Theorem [U we get 
Theorem [7] proved. D 

Theorem [7] shows that, without the non-degradation condition, we need to request O(-) labels 
to learn a classifier with error rate no larger than R{S*) + e with probability at least 1 — 6. The 
order of 1/e is independent of the parameter in Tsybakov noise. Similarly to Theorem [3l we 
get Theorem [8] which indicates that both h^ and /il are good approximations of the optimal 
classifier. 

Theorem 8 For data distribution T> a-expanding with respect to hypothesis class Tii x ^2 O'C- 
cording to DefinitionUl if ^ holds, by requesting O(-) labels the multi-view active learning in 
Table 2 will generate two classifiers h^ and h'L which satisfy either (a) or (b) with probability at 
least 1 - 6. (a) R{h%) < R{S*) + e and R{h'_) < R{S*) + 0(e); (b) R{h%) < R{S*) + 0(e) and 
R{h'_) <RiS*) + e. 

Proof: According to Theorem [6l by requesting O(-) labels the multi-view active learning in 
Table 2 will generate two classifiers h^ and ht_ , at least one of which is with error rate no larger 
than R(S*) + e with probability at least 1 — 6. Similarly to the proof of Theorem [3l we get 
Theorem [8] proved. D 

As for the situation where S'J 7^ 6*2 , similarly to Theorem [S] and Corollary [H we have Theorem E] 
and Corollary [2j 

Theorem 9 For data distribution T> a-expanding with respect to hypothesis class Tii x 'H2 ac- 
cording to ilO\) . if the multi-view classifiers satisfy /3-condition, by requesting O(-) labels the 
multi-view active learning in Table 2 will generate a classifier whose error rate is no larger than 
R{S\ n 5*2) + e with probability at least 1 — 6. 

Proof: Similarly to the proof of Theorem [H and Theorem [UJ we know that by requesting 
O(-) labels the multi-view active learning in Table 2 can get either R{h^) < R{Si fl ^2) + e or 
R{h^_) < R{Si n S2) + e with probability at least 1 — 2- According to Lemma [H by requesting 

21og(-) 

— qt^ labels we can decide correctly whether R{h^) or R{h^) is smaller with probability at 
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<5 



least 1—2- Thus, we can get a classifiers whose error rate is no larger than R{Sl (1 S2) + e with 

log( 



probability at least 1 — S. The total number of labels to be requested is O(-) H m-^ = 0{-] 



D 



Corollary 2 For data distribution V a-expanding with respect to hypothesis class Tii x'H2 accord- 
ing to il(J\} . if the multi-view classifiers satisfy f3-condition and Pr{S^ © S'g) < e/2, by requesting 
O(-) labels the multi-view active learning in Table 2 will generate a classifier with error rate no 
larger than R{S*) + e (v = 1, 2j with probability at least 1 — 6. 

Proof: According to Theorem O we know that by requesting 0{-) labels the multi-view active 
learning in Table 2 will generate a classifier whose error rate is no larger than R{Sl n S'2) + | 
with probability at least 1 — 5. With the proof of Corollary [H we get that R{Sl D S2) + | is no 
larger than R{S*) + e. D 



7. Empirical Verification 

In this section we empirically verify that whether multi-view setting can improve the sample 
complexity of active learning in the non-realizable case remarkably. 



In the experiment we use the semi- artificial data set [2(| and the course data set |6|. The semi- 
artificial data set has two artificial views which are created by randomly pairing two examples 
from the same class and contains 800 examples. In order to control the correlation between 
the two views, the number of clusters per class can be set as a parameter. We use 1 cluster, 
2 clusters and 4 clusters in the experiments, respectively. The course data set has two natural 
views: pages view (i.e., the text appearing on the page) and links view (i.e., the anchor text 
attached to hyper-links pointing to the page) and contains 1,051 examples. We randomly use 
25% data as the test set and use the remaining 75% data to generate the unlabeled data set U. 
We use Random Sampling as the baseline. In each round, we fix the number of examples to be 
queried in Multi-View Active Learning and that in Random Sampling. Thus, we can study their 
performances under the same number of queried examples. In the experiments, we query two 
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Figure 1: Multi-view setting improves the sample complexity of active learning in the non- realizable case remark- 
ably. 

examples in each round of the two methods and implement the classifiers with NaiveBayes in 
WEKA. The experiments are repeated for 20 runs and Figure [1] plots the average error rates of 
the two methods against the number of examples that have been queried. From Figure [T] it can be 
found that the performance of Multi-View Active Learning is far better than the performance of 
Random Sampling with the same number of queried examples. In other words, multi-view setting 
can help improve the sample complexity of active learning in the non-realizable case remarkably. 



8. Conclusion 

We present the first study on active learning in the non-realizable case under multi-view setting 
in this paper. We prove that the sample complexity of multi-view active learning with unbounded 
Tsybakov noise can be improved to 0(log -), contrasting to single-view setting where only poly- 
nomial improvement is proved possible with the same noise condition. In general multi-view 
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setting, we prove that the sample complexity of active learning with unbounded Tsybakov noise 
is O(-), where the order of 1/e is independent of the parameter in Tsybakov noise, contrasting 
to previous polynomial bounds where the order of 1/e is related to the parameter in Tsybakov 
noise. Generally, the non-realizability of learning task can be caused by many kinds of noise, e.g., 
misclassification noise and malicious noise. It would be interesting to extend our work to more 
general noise model. 
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